Topic-dependent n-gram models based on optimization of context lengths in LDA
نویسندگان
چکیده
This paper describes a method that improves accuracy of Ngram language models which can be applied to on-line applications. The precision of a long-distance language model including LDA is influenced by a context length, or a length of the history used for prediction. In the proposed method, each of multiple LDA units estimates an optimum context length separately, then those predictions are integrated and Ngram probabilities are calculated. The method directly estimates the optimum context length suitable for prediction. Results show the method improves topic-dependent N-gram probabilities, particularly of a word related to specific topics, yielding higher and more stable performance comparing to an existing method.
منابع مشابه
یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجرههای همپوشان
A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...
متن کاملStyle And Topic Language Model Adaptation Using HMM-LDA
Adapting language models across styles and topics, such as for lecture transcription, involves combining generic style models with topic-specific content relevant to the target document. In this work, we investigate the use of the Hidden Markov Model with Latent Dirichlet Allocation (HMM-LDA) to obtain syntactic state and semantic topic assignments to word instances in the training corpus. From...
متن کاملTopic dependent language model based on topic voting on noun history
Language models (LMs) are important in automatic speech recognition systems. In this paper, we propose a new approach to a topic dependent LM, where the topic is decided in an unsupervised manner. Latent Semantic Analysis (LSA) is employed to reveal hidden (latent) relations among nouns in the context words. To decide the topic of an event, a fixed size word history sequence (window) is observe...
متن کاملCache-Augmented Latent Topic Language Models for Speech Retrieval
We aim to improve speech retrieval performance by augmenting traditional N-gram language models with different types of topic context. We present a latent topic model framework that treats documents as arising from an underlying topic sequence combined with a cache-based repetition model. We analyze our proposed model both for its ability to capture word repetition via the cache and for its sui...
متن کاملPCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names
This paper establishes a connection between two apparently very different kinds of probabilistic models. Latent Dirichlet Allocation (LDA) models are used as “topic models” to produce a lowdimensional representation of documents, while Probabilistic Context-Free Grammars (PCFGs) define distributions over trees. The paper begins by showing that LDA topic models can be viewed as a special kind of...
متن کامل